for GdPicture.NET
Getting Started / .NET toturials / Document classification and extraction in C#
In This Topic

    Document classification and extraction in C#

    In This Topic

    This tutorial describes the steps required to create a solution for categorizing and extracting data.


    Use case definition

    The CRM system of ACME company is currently handling a variety of document types, such as invoices, resumes, purchase orders, and payroll statements. There is a need for an automated solution that can intelligently categorize these diverse documents and extract pertinent data based on their respective categories.

    Prerequisties

     -> Check the Prerequisites page.


    Setting up document templates

    An enumerable of DocumentTemplate objects must be created.

    This collection will represent document templates, serving as the comprehensive definitions for specific types of documents, applicable to both the classification and extraction processes.

     

    XtractFlow Document templates selection in csharp

    Copy Code
    static List<DocumentTemplate> setupDocumentTemplates()
    {
        List<DocumentTemplate> templates = new List<DocumentTemplate>();
        templates.Add(DocumentTemplates.Invoice); //adding invoice preset.
        templates.Add(DocumentTemplates.Resume); //adding resume preset.
        templates.Add(DocumentTemplates.PurchaseOrder); //adding purchase order preset.
        templates.Add(DocumentTemplates.PayrollStatement); //adding payroll statement preset.
        return templates;
    }

    Building the component

    Create a ProcessorComponent object, which is a necessary component for the processor.

    This object will encapsulate the document processing workflow's logic.

     

    XtractFlow ProcessorComponent generation in csharp
    Copy Code
    static ProcessorComponent buildComponent()
    {
        return new ProcessorComponent()
        {
            EnableClassifier = true, // enabling classification.
            EnableFieldsExtraction = true, // enabling extraction.
            Templates = setupDocumentTemplates()
        };
    }

    Processing the documents

    At this point, it is necessary to instantiate a DocumentProcessor object and invoke the Process method to initiate the inference process.

    Subsequently, a ProcessorResult object will be returned, encompassing the processing outcome.

     

    Using XtractFlow DocumentProcessor in csharp
    Copy Code
    // building the component
    ProcessorComponent component = buildComponent();
    // processing all documents
    foreach (string documentFile in Directory.GetFiles([DIRECTORY_PATH]))
    {
        ProcessorResult result = new DocumentProcessor().Process(documentFile, component);
        // analyzing results
        if (result.Template != null)
        {
            Console.WriteLine("Document category:" + result.Template.Name);
            if (result.ExtractedFields != null)
            {
                foreach (var item in result.ExtractedFields)
                {
                    Console.WriteLine($"Field name: '{item.FieldName}' - Field value: '{item.Value}' - Validation state: ({item.ValidationState})");
                }
            }
        }
    }

    The complete solution

    Using XtractFlow to achieve classification and data extraction
    Copy Code
    static void runExtraction()
    {
        Configuration.RegisterGdPictureKey("GDPICTURE_KEY");
        Configuration.RegisterLLMProvider(new OpenAIProvider(OPENAI_KEY));
        Configuration.ResourcesFolder = "resources";
        // building the component
        ProcessorComponent component = buildComponent();
        // processing all documents
        foreach (string documentFile in Directory.GetFiles([DIRECTORY_PATH]))
        {
            ProcessorResult result = new DocumentProcessor().Process(documentFile, component);
            // analyzing results
            if (result.Template != null)
            {
                Console.WriteLine("Document category:" + result.Template.Name);
                if (result.ExtractedFields != null)
                {
                    foreach (var item in result.ExtractedFields)
                    {
                        Console.WriteLine($"Field name: '{item.FieldName}' - Field value: '{item.Value}' - Validation state: ({item.ValidationState})");
                    }
                }
            }
        }       
    }
    
    static ProcessorComponent buildComponent()
    {
        return new ProcessorComponent()
        {
            EnableClassifier = true, // enabling classification.
            EnableFieldsExtraction = true, // enabling extraction.
            Templates = setupDocumentTemplates()
        };
    }
    
    static List<DocumentTemplate> setupDocumentTemplates()
    {
        List<DocumentTemplate> templates = new List<DocumentTemplate>();
        templates.Add(DocumentTemplates.Invoice); //adding invoice preset.
        templates.Add(DocumentTemplates.Resume); //adding resume preset.
        templates.Add(DocumentTemplates.PurchaseOrder); //adding purchase order preset.
        templates.Add(DocumentTemplates.PayrollStatement); //adding payroll statement preset.
        return templates;
    }